-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: use new parquet library #103293
Conversation
16142f4
to
d926bd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/ccl/changefeedccl/changefeedbase/options.go
line 1061 at r2 (raw file):
// Right now parquet does not support any of these options if s.m[OptFormat] == string(OptFormatParquet) { if isPredicateChangefeed {
Enabling parquet + predicate changefeeds + diff was added in #94653, but there's not much of an explanation for why. If you look at the parquetPossible
randomized logic in pkg/ccl/changefeedccl/testfeed_test.go
, it looks like we disabled testing the parquet format with diff
. I've decided to leave diff
unsupported until we add it explicitly with proper testing in #103129.
d926bd5
to
ddb4762
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to address the fact that tuples are not supported; this breaks cdc queries, and cdc_prev tuple.
cdc_prev also turns on "diff" option.
pkg/ccl/changefeedccl/parquet.go
Outdated
if includeParquetTestMetadata { | ||
if opts, err = addCDCTestMetadata(row, opts); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can get rid of includeParquetTestMetadata altogether? LIke at the call site (testfeed?) if we decide to use parquet, then.. just explicitly specify additional metadata?
if typ.ArrayContents().Family() == types.ArrayFamily { | ||
return result, pgerror.Newf(pgcode.FeatureNotSupported, | ||
"parquet writer does not support nested arrays") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay to merge. Supporting tuples (needed for cdc_prev) is on the list of items to complete before making the new library public #99028. I'm working on those changes concurrently to this PR and plan to merge them next. There are some issues with the old library, so my changes are strictly better than what is in the main branch IMO:
The old library "supports" cdc_prev, tuples, and diff, but they don't work. You can create a changefeed ex. create changefeed into 'nodelocal://0/baz' with format='parquet', diff as select *, cdc_prev from foo;
. The cdc_prev
tuple causes an error, because tuples are not supported. If you pass diff
, parquet does not emit the diff. The reason for this surprising behavior is that we disable parquet in tests that use diff
, but enable parquet in production with diff
when using cdc queries. My change explicitly disallows these cases in production and testing. I will incrementally enable them as I add more features.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @miretskiy)
pkg/ccl/changefeedccl/parquet.go
line 67 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I wonder if we can get rid of includeParquetTestMetadata altogether? LIke at the call site (testfeed?) if we decide to use parquet, then.. just explicitly specify additional metadata?
Can you please clarify this? The metadata contains information from the rows (the keys and non keys) which are used in testing ex. assertPayloads([]string{`[key] -> {"a": 1}`})
. I think this function is the best place to get that information easily.
pkg/util/parquet/schema.go
line 350 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
why not?
A few reasons.
If I understand correctly, we don't support multidimensional arrays, even in projections, so we should never see them. @HonoreDB also tried a few cases with me, but we couldn't make a nested array.
select array[[1,2,3],[4,5,6]] as col;
ERROR: unimplemented: arrays cannot have arrays as element type
SQLSTATE: 0A000
HINT: You have attempted to use a feature that is not yet implemented.
See: https://go.crdb.dev/issue-v/32552/v23.1
The code I wrote definitely does not support nested arrays, it assumes arrays are arrays of scalar, so I just made the assertion explicit.
I'm fairly confident arbitrary nesting is not supported in parquet because I could not find anything about it in the docs/spec or by searching online. At best, there are some hacks (https://medium.com/hadoop-noob/recursive-avro-schema-for-parquet-a87c84b6aca1), but I don't feel confident relying on them. Chatgpt does not recommend it either :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay; just for my info: what kind of error?
If you don't have explicit issue, could you file one to get cdc_prev/diff/parquet working in a sane way?
Reviewed 1 of 13 files at r1, 1 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @jayshrivastava)
pkg/ccl/changefeedccl/parquet.go
line 114 at r3 (raw file):
// addCDCTestMetadata appends options to the provided options to configure the // parquet writer to write metadata required by cdc test feed factories. func addCDCTestMetadata(row cdcevent.Row, opts []parquet.Option) ([]parquet.Option, error) {
nit: perhaps addParquetTestMetadata?
pkg/ccl/changefeedccl/parquet.go
line 67 at r4 (raw file):
Previously, jayshrivastava (Jayant) wrote…
Can you please clarify this? The metadata contains information from the rows (the keys and non keys) which are used in testing ex.
assertPayloads([]string{`[key] -> {"a": 1}`})
. I think this function is the best place to get that information easily.
I was just curious if we can explicitly pass an argument indicating whether or not we should include test metadata.
Perhaps rely on some knobs -- like if changefeed knob is non-nil, then this is test code, and so, add this metdata.
I understand why we need this metadata -- just not sure we need this global boolean flag to achieve this.
Feel free to ignore if too much hassle.
pkg/ccl/changefeedccl/changefeedbase/options.go
line 1061 at r2 (raw file):
Previously, jayshrivastava (Jayant) wrote…
Enabling parquet + predicate changefeeds + diff was added in #94653, but there's not much of an explanation for why. If you look at the
parquetPossible
randomized logic inpkg/ccl/changefeedccl/testfeed_test.go
, it looks like we disabled testing the parquet format withdiff
. I've decided to leavediff
unsupported until we add it explicitly with proper testing in #103129.
As I recall it, it was made purely to make tests work -- we randomly pick parquet sink sometimes.
pkg/util/parquet/schema.go
line 350 at r4 (raw file):
Previously, jayshrivastava (Jayant) wrote…
A few reasons.
If I understand correctly, we don't support multidimensional arrays, even in projections, so we should never see them. @HonoreDB also tried a few cases with me, but we couldn't make a nested array.
select array[[1,2,3],[4,5,6]] as col; ERROR: unimplemented: arrays cannot have arrays as element type SQLSTATE: 0A000 HINT: You have attempted to use a feature that is not yet implemented. See: https://go.crdb.dev/issue-v/32552/v23.1
The code I wrote definitely does not support nested arrays, it assumes arrays are arrays of scalar, so I just made the assertion explicit.
I'm fairly confident arbitrary nesting is not supported in parquet because I could not find anything about it in the docs/spec or by searching online. At best, there are some hacks (https://medium.com/hadoop-noob/recursive-avro-schema-for-parquet-a87c84b6aca1), but I don't feel confident relying on them. Chatgpt does not recommend it either :)
Ack. Thanks for the explanation. PostgreSQL supports multi dimensional arrays, and I think it's a matter of time before we do too.
I wonder if we should have a fall back mechanism i this library: if unsure -- convert to JSON. Every type must be convertable to json.
At any array, could you add a bit of a commentary here, and link to
https://go.crdb.dev/issue-v/32552/v23.1
pkg/ccl/changefeedccl/testfeed_test.go
line 1347 at r3 (raw file):
} keyBuf := bytes.Buffer{}
var keyBuf bytes.buffer
?
This change updates changefeeds to use the new parquet library added in `pkg/util/parquet` when using `format=parquet`. Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 Release note: None
Epic: none Release note: None
ddb4762
to
63eccd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is transient error: parquet export does not support the TupleFamily type yet
. At least it's a sane error. With this change, you won't even be able to create the changefeed. I'm working on adding tuples now - #103589.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @miretskiy)
pkg/ccl/changefeedccl/parquet.go
line 114 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
nit: perhaps addParquetTestMetadata?
Done.
pkg/ccl/changefeedccl/parquet.go
line 67 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I was just curious if we can explicitly pass an argument indicating whether or not we should include test metadata.
Perhaps rely on some knobs -- like if changefeed knob is non-nil, then this is test code, and so, add this metdata.
I understand why we need this metadata -- just not sure we need this global boolean flag to achieve this.
Feel free to ignore if too much hassle.
Done. I made it a testing knob.
pkg/util/parquet/schema.go
line 350 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Ack. Thanks for the explanation. PostgreSQL supports multi dimensional arrays, and I think it's a matter of time before we do too.
I wonder if we should have a fall back mechanism i this library: if unsure -- convert to JSON. Every type must be convertable to json.
At any array, could you add a bit of a commentary here, and link to
https://go.crdb.dev/issue-v/32552/v23.1
Ack. Will implement this as discussed IRL. Working on adding tuple support as well.
pkg/ccl/changefeedccl/testfeed_test.go
line 1347 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
var keyBuf bytes.buffer
?
Done.
bors r=miretskiy |
Build succeeded: |
This change updates changefeeds to use the new parquet library
added in
pkg/util/parquet
when usingformat=parquet
.Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None